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Fig. 1: Memory interface 
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XPPPreloa(lConfig( XppCfg_foo ); 

for (int i=0; i <1000; { 

XPPPreload( 2, &a[i*30], 30 ); 

XPPPreload( 0, &b[i*200], 200 ); 

XPPPreloadClean( 5, &c[i*10], 10); 

XPPExecute( ); 
/* 

Other RISC computations ... 
In the meanwhile the burst preloads and 
the previous configuration are running; 
The new configuration is executed as soon 
as the preloads and the previous 
configuration are finished. 
New burst preloads can be issued 
according to the FIFO length. 
•/ 

} 

Note: in all places where constants are used, 
the value should actually come from a register 
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Fig. 2: IRAM & configuration cache controller data structures and 
usage example 
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Fig. 3: Asynchronous pipeline of the XPP 
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Fig. 4: State transition diagram for the XPP cache controller 
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Fig. 5: Adding simultaneous multithreading 
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Fig. 7:Control-flow graph of a piece of program 
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int *p, b[100]; 




B2 *p = b; 
< uses of b and p> 
*p = malloc(); 
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*p = inalloc(); 
<uses of b and p> 
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Fig. 8: Example of control-flow sensitivity 
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for ( j=l; j<=N-l;i++) 
for ( j=l; j<=N; j++) 

b[i] [j] = 0.25* (a[i- 
a[i+l] [j] + a[i] [j+1] ) ; 



1] [j] + a[i] [j-1] + 
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Fig. 10: Example for array merging 



PACT45/PCTE 



1 

Code Preparation 



Partitioning 



XPP Compiler 



RISC Code Gen. 



RISC Code Sched. 



Fig. 11: Global View of the Gompiling Process 



PACT45/PCTE 



exit- 



yes 



1 

XPP Loop Opt. 



no 



fail & no change |{ too 



Temporal Partitioning 



yes 



no 



too many fails? 



NML Code Gen. 




Fig. 12: Detailed Architecture of the XPP Compiler 
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Fig. 13: Detailed View of the XPP Loop Optimization 
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Fig. 14: 

Converter modules for conversion from and to shorter data types. The signed ver- 
sions suffixed with '_sb' do correct sign extension. All modules 16-bit converters 
must be connected to '101010..' event streams while the '32to8'-converters must be 
fed with a '10001000...' sequence and the '8to32' must be fed with an a 
'00010001...' sequence, respectively. All modules output one packet/cycle. 
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Input preparation with shift register synthesis. For each I RAM access one of these 
modules is generated. 
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Fig. 17 

A sample picture with the size 640 x 480 pixels. Without precautions loop tiling would 
miss the pixels on the borders between the tiles. 



PACT45/PCTE 



Fig. 18 



IRAMO 















xp+e] 










x[i+7] 




xti-t6] 




xti+5] 




x[i+4] 




x[i+3] 




xp+2] 




xti+ii 




Fig. 19 
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Fig. 20 
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Fig. 22: 

The visualized array access sequences after optimization. Here the 
provement is evident, since array B is now read following the cache 
lines. 
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Fig. 23: 



Dataflow graph of matrix multiplication after unroll-and-jam. Counters and address 
calculations are omitted. 
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Fig. 25: The modified dataflow graph, where unrolling and splitting have been omitted for simplicity 
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Fig. 26: 

Dataflow graph of the MP EG2 inverse quantization for intra coded blocks. The yellow and green blocks were 
produced by partial unrolling. The difference is that the green block must no account for the special iteration 
value 0. The blue block does the accumulation which alters the value at iteration 64 if necessary. 
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Fig. 29 
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Fig. 31 : Data layout transformations in idct configurations 
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Fig. 32: Dataflow graph of the innermost loop nest. 
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Fig. 3S: Functions of an RDFP 
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Fig.-^ ' General Conditional Statement Template 
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Fig. 49: LEON Architecture Overview 
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Fig. 50: LEON Pipelined Datapath Structure 
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Fig. 51 : Structure of an XPP device 
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Fig. 53: LEON-to-XPP dual-clock FIFO 



PACT45/PCTE 



15r 




IIXT<S)cS>-Bldcl^ 



Fig. 55: Computation time of IDCT (8x8) 
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Fig. 56: MPEG-4 Decoder Blockdiagram 
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